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SAMPLING FOR AREA ESTIMATION: A COMPARISON 0F pooR QUALITY 

OF FULL-FRAME SAMPLING WITH THE SAMPLE 
SEGMENT APPROACH 

MARILYN M. HIXSON, MARVIN E. BAUER 
Purdue University 

BARBARA J. DAVIS 

Indiana Bell Telephone Company 


ABSTRACT 

The objective of this investigation 
was to evaluate the effect of sampling on 
the accuracy (precision and bias) of crop 
area estimates made from classifications 
of Landsat MSS data. Full-frame classi- 
fications of wheat and non-wheat for 
eighty counties in Kansas were repeti- 
tively sampled to simulate alternative 
sampling plans. Four sampling schemes 
involving different numbers of samples 
and different size sampling units were 
evaluated. The precision of the wheat 
area estimates increased as the segment 
size decreased and the number of segments 
was increased. Although the average 
bias associated with the various sampling 
schemes was not significantly different, 
the maximum absolute bias was directly 
related to sampling unit size. 


I. INTRODUCTION 

Accurate and timely crop production 
information is essential for planning 
the production, storage, transportation, 
and processing of grain crops, making mar- 


surveys has been the Large Area Crop In- 
ventory Experiment (LACIE) The purpose 
of LACIE was to assimilate current remote 
sensing technology into an experimental 
system and evaluate its potential for de- 
termining the production of wheat in 
various regions of the world. In LACIE, 
area estimates were made from classifica- 
tions of Landsat MSS data. Yield was 
estimated for fairly broad geographic 
regions using statistical regression 
models developed from historical weather 
and wheat yield data. 

For the area estimation phase of 
LACIE, samples, five by six nautical miles 
in size, were selected for analysis to 
represent about two percent of the agri- 
cultural land area. Segments were allo- 
cated to political units according to the 
historical area of wheat. The sample seg- 
ments were used both for training the 
classifier and for aggregation to obtain 
area estimates. The LACIE method was 
generally successful in obtaining unbiased 
and precise area estimates. Six hundred 
segments were selected in the United 
States, and 1900 in the Soviet Union, to 
achieve a sampling error of two percent. 


keting decisions, and determining national 
agricultural policies. Although most 
countries of the world gather crop pro- 
duction data, relatively few countries 
have reliable inventory systems. The 
synoptic view of the earth provided by 
satellite remote sensing, along with com- 
puter processing of the data, provides 
the opportunity to identify and estimate 
the area of crops. 

The most comprehensive investigation 
of the use of Landsat MSS data for crop 


This research was sponsored by the 
National Aeronautics and Space Administra- 
tion, Johnson Space Center (Contract 
NAS9-14970) . 


An alternative sampling plan for 
obtaining area estimates was used in 
another investigation at LARS . * A syste- 
matic sample of pixels spread throughout 
a Landsat full-frame was classified and 
used to make estimates, while training 
data were obtained separately. The class- 
ifications were performed on a county basis 
using every other line and every other 
column of Landsat data. Training statis- 
tics were developed using photointerpre- 
tation from aerial infrared photography 
taken along several flightlines dispersed 
throughout the state and were extended to 
counties lacking reference data, but 
known to have similar land use, crops, and 
soils. The pixel sampling approach was 
demonstrated to have the capability to 
produce unbiased and precise area estimates 
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for small (e.g., county) as well as 
large (e.g., state) geographic areas. 

The goal of any estimation procedure 
Is to obtain an accurate estimate. Bias 
and precision are both components of 
accuracy. Bias refers to the size of 
deviations from the true parameter, while 
precision refers to the size of deviations 
from the mean of all estimates of the 
parameter obtained through repeated appli- 
cations of the sampling procedure. *■ 

Numerous aspects of the crop inven- 
tory problem using remote sensing may 
affect the bias and precision of the 
estimates. Choices involving the spectral 
features to be measured, the sensor to 
be utilized, the timing of the crop 
observation, and the analysis methods 
used are all important aspects to be con- 
sidered in the design of a remote sensing 
system. One consideration which has not 
been extensively researched is the choice 
of sampling method for area estimation. 



Figure 1. Landsat Full-Frame 
Classifications of Kansas. Alternative 
sampling schemes were simulated using 
these data. 


II. 


OBJECTIVES 


The overall objective of this in- 
vestigation was to evaluate the effect 
of sampling on the accuracy of crop area 
estimates made from classifications of 
Landsat MSS data. The specific objectives 
were to assess the precision and bias 
associated with alternative sampling 
schemes involving different numbers of 
samples and different sampling unit sizes. 


III. EXPERIMENTAL APPROACH 

Ideally, a study of bias and pre- 
cision of a sampling scheme would be con- 
ducted by sampling repetitively from the 
population of interest. In this case, 
however, the population of interest is 
the true distribution of crops in a state 
(or other region) , and this truth is not 
generally known for large regions. 

An alternative approach to actually 
conducting the experiment is to simulate 
its occurrence. Simulated data are used 
instead of truth and they are repetitively 
sampled to determine a variance. The 
estimates made are compared for bias not 
with truth, but with the mean of the dis- 
tribution from which the data were gene- 
ra ted . 

The approach taken in this study is 
a combination of the two approaches des- 
cribed above. Full-frame classifications 
of Kansas into wheat and non-wheat made in 
another investigation 1 were used in this 
study as simulated ground truth. Eighty 


counties comprising seven crop reporting 
districts were included. The Landsat 
frames used in these classifications are 
shown in Figure I. The estimates of 
wheat area obtained in that study did not 
differ significantly from the USDA/SRS 
estimates at the. state level. The full- 
frame classifications were considered to 
have negligible sampling error and were 
repetitively sampled to simulate alterna- 
tive sampling plan* . 

Four sampling schemes were selected 
for testing. The total number of pixels 
in the sample was held constant, and the 
sampling unit size and number of samples 
were varied. Two types of samples were 
considered: cluster (segment) sampling 

and point (pixel.Jsampling of full-frames. 


Sampling Unit Size 

No. of Samples 

5 x 6 nn 

75 

4x4 nra 

137 

2x2 nm 

560 

Pixel 

427,587 


Procedures similar to those followed 
in LACIE were used to determine the allo- 
cation (number) of samples, location 
(geographic placement) of segments, and 
the aggregated area estimate of wheat. 5,7 


A. SAMPLE SEGME5JT ALLOCATION 

Based on 84 sample segments which 
were allocated to the state of Kansas in 
LACIE, the number of segments per county 
was computed. Khe threshold value for each 
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county was computed based on the total 
number of acres in the county and the 
standard deviation of the proportion of 
wheat in that county. For county k. 


where A. is the total land area in 
county K, and p. is the historical pro- 
portion of wheat in county k. The propor- 
tional number of sample segments allotted 
to each county was computed by: 


where t, * is as defined above, and n is 
the number of counties in the state. 

The number of sample segments allotted to 
each crop reporting district (CRD) in 
the state was computed similarly. 

The type of sample was then deter- 
mined by the following procedure: 

1. Stratified sample segment - all 
counties with N fc l o.5 will have at 
least one sample segment; the actual 
number of segments is the rounded 
value of 

2. No sample segments allotted if 

V- °- 1 - 

3. Probability proportional to size 
(PPS) sampling is done otherwise, 
spreading remaining segments for 

the CRD among the remaining counties. 

Allocations strictly according to 
the LACIE procedure produced county all- 
ocations which did not add to the total 
number allocated for the crop reporting 
district. It was found that LACIE had 
also encountered this problem and had 
adjusted its allocations to achieve 
consistency. Determination of the number 
of segments per county followed the 
scheme given below for 5 x 6 nm segments 
because more consistent results were ob- 
tained than with the method given in the 
LACIE documentation: 


Segments Allocated 


Value of n^ 

0.0 - 0.3 
0.3 - 0.6 
0.6 - 1.6 
1.6 - 2.6 


Two counties received two sample segments", 
seven counties received no sample seg- 
ments, and the remainder of the counties 
received one segment in the 5 x 6 nm 


segment allocation. The criteria were gen* 
eralized for other segment ..izes. 

B. SAMPLE SEGMENT LOCATION 

The selection of sample segments was 
computer-implemented. This allowed a 
large number of segments to be chosen with 
little personnel time and also facilitated 
choice of any segment size or number of . 
segments. The greater number of samples 
which could be taken through automated 
selection permitted statistical tests of 
precision. The description of the proce- 
dure which was implemented follows. 

A grid, spaced six nautical miles in 
the east-west direction and five nautical 
miles in the north-south direction, was 
defined to cover the state of Kansas. To 
select a sample for a given county, the 
number of segments whose centers were 
inside the county boundaries but which did 
not fall entirely in the defined non- 
agricultural areas was determined and b 
sample was randomly selected from these. 

The selected segment was then 
checked against a set of constraints. The 
constraints for the 5 x 6 nm segments are 
given here. The new segment was discarded 
if there was another sample segment within 
a 12 x 10.5 nm rectangle centered about 
the new segment. Then two extended rec- 
tangles were defined: one, running in the 

east-west direction, was 10.5 x 80 nm, 
and the other, running north-south, was 
12 x 100 nra. Only four sample segments 
were permitted to fall in the east-west 
extended rectangle, and no more than 
eight sample segments were permitted to . 
fall in the north-south extended rectangle. 
If the new segment caused any of these 
constraints to fail, it was discarded, and 
a new random draw was made. 


Table 1. Location Constraints for the 
Different Segment Sizes. 

Segment Rectangle Segments Allowed in 
Size Considered Extended Rectangle 

E-W N-S ~ 

(nm) (nm) 


5x6 

4x4 

2x2 


10.5 x 12 
8.4 x S 
4.2 x 4 


The location of sample segments diff- 
ered in two respects from the location of 
the LACIE segments: first, in the defini- 

tion of nonagrictr.ltura) areas and second, 
in the number of segments permitted in a 
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window or extended rectangle about a given 
segment. 

Nonagricultural areas of at least 
2x2 miles in size were excluded from 
consideration as sample segments. The 
boundaries of urban areas, federal lands, 
reservoirs, etc., appearing on county maps 
prepared by the State Highway Commission 
of Kansas, Department of Planning and 
Development were found using a coordinate 
digitizer. The boundary definitions of 
nonagricultural areas were somewhat more 
crude than those defined by LAC IE . The 
reasons for this include: (1) constraints 
of time (including computer time) and 
resources (including detailed maps) and 
(2) the belief that only major nonagri- 
cultural areas needed to be excluded be- 
cause experience in another investigation* 
indicated that even when few nonagricul- 
tural areas are excluded, estimates of 
high accuracy can be obtained. - The con- 
straint that a sample segment not fall 
within a nonagricultural area was ignored 
with the pixel sampling method due to 
excessively high costs of computer check- 
ing for each of the nearly four million 
samples. 

The constraints concerning the num- 
ber of segments permitted in a given size 
rectangle centered about the sample seg- 
ment and its east-west and north-south 
extensions to 80 nm and 100 nm, respec- 
tively, were adjusted by naraber and size 
of the rectangle to be relatively consis- 
tent with the constraints for the LAC IE 
5 x 6 nm segments (Table 1). This type 
of constraint was not feasible to use for 
the pixel selection procedure. 


C. AREA ESTIMATION PROCEDURE 

Wheat area estimates were calculated 
for each replication for the counties and 
were aggregated to obtain estimates for 
the crop reporting districts and state. 

For each crop reporting district, the 
area estimate was computed by 

A i “ A lj + A 2j + A 3j 

where A, . is the estimate of the area in 
the counivts within the crop reporting 
district which had no segments allocated; 
A,j is the estimate for those counties 
which were allocated segments with proba- 
bility proportional to size; and A,, is 
the estimate for counties a llocated^one or 
more segments. 

For the m. counties falling into 
class 3, A.. i3 simply the sum of the 
areal proportion of wheat in each county 


as estimated from the sample segments mul- 
tiplied by the area of the counties con- 
taining the segments: 


A 31 

* 

where p.. is the wheat areal proporation in 
in the 3K k th county estimated from the seg- 
ments and weighted according to the non- 
agricultural area, and A fe is the total 
land area in the kt* 1 county. 

For that set of counties in a crop 
reporting district to which segments were 
allocated with probability proportional to 
size, the area of wheat was estimated by: 

A „ A !i_?i h* 

2 i 2 ”jk-iPjk 

where m. is the number of sample segments 
in this*3et of counties; A, is the total 
land area of counties in the group; p 
is the Landsat estimate of wheat pro-^ 
portion in the k fc h county; p. is the 
agricultural census wheat proportion in 
the k* 1 " county; and p. i3 the census esti- 
mate foi all counties*in that group. 

For the m. counties in the j 1 * 1 dis- 
trict which received no sample segments, 
the area estimate is: 

A = (A 2j + A 3j> „ 

A *j a A j 

A 2 A 3 

where x. is the agricultural census wheat 
area fo? the counties in this group, and 
A. is the total land area for all counties 
in group i. 

For each sampling plan, a standard 
deviation was computed for the estimate 
using four replications. Two sampling 
errors per plan and eight means per plan 
were available for statistical analysis. 

The analyses were performed using non- 
parametric techniques since the nonhomo- 
geneous variances did not satisfy the re- 
quirements for classical statistical test- 
ing. 
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Table 2. Comparison of Bias and Precision Associated with Different Sampling Schemes. 


Sampling Scheme 


Number Sample 
of Unit 

Samples Size 


Bias 

Wean Maximum Average 

C000 Ha) (000 Ha) (000 Ha) 


Average 

Relative 

Difference 


Standard 

Deviation 

(000 Ba) 


Coefficient 
of Variation 


75 5 x 6 nm 5550.9 498.2 

137 4 x 4 nm 5365.0 -227.4 

560 2 x 2 nm 5409.6 80.5 

427,587 Pixel 5405.9 -39.1 


127.5 

2.4 

223.7 

4.0 

-58.4 

1.1 

86.3 

1.6 

-13.8 

0.3 

55.2 

1.0 

-17.5 

0.3 

12.1 

0.2 


IV. RESULTS AND DISCUSSION 


The effects of varying sampling unit 
si 2 e and the number of samples are ill- 
ustrated in Figure 2 and are summarized 
in Table 2. Qualitative and quantitative 
discussions of the precision and bias of 
the estimates follow. 

A. PRECISION 



The horizontal line in Figure 2 re- 


presents the total number of hectares of 
wheat in the classifications which were 
sampled. This number is the true popu- 
lation parameter which is to be estimated. 
A large systematic bias is not indicated 
since the population parameter falls in 
the center portion of the range of the 


The results in Figure 2 show that 
the use of larger sample unit sizes 
results in a greater range and more vari- 
ability in the estimates. The standard 
deviations obtained range from 11,300 hec- 
tares for pixel samples to 237,500 hec- 
tares for 5 x 6 nm segments (Table 2) . 
Coefficients of variation range from 0.24 
for pixel samples to 4.0% for 5 x 6 nm 
segments. The variability associated w'th 
the pixel samples is thus nearly negli- 
gible, while the 4% variability associat- 
ed with one group of the 5 x 6 nm segments 
does not seem to be negligible. 

These observations are supported by 
statistical results. A distribution 
free multiple comparison test based on the 
Kruskal-Wallis rank sums was performed. 4 
This test was used to assess which pairs 
of sample unit sizes, if any, had signifi- 
cantly different sampling errors. At 
the 54 level of significance, the only 
pair of sampling unit sizes which had 
significantly different standard devia- 
tions was the 5 x 6 nm and pixel samples. 

B. BIAS 

The results presented in Figure 2 
indicate that there may be some difference 
in the means of estimates made using L..e 
different sampling units. The means 
range from 5,365,000 hectares to 5,550,900 
hectares (Table 2) . Unlike the standard 
deviations, the means are not ranked in 
order according to the sample unit size. 



5x6NW 4x4NM 2x21^4 PIXEL 
SEGMENT SIZE 


Figure 2. Comparison of Estimates Associ- 
ated with Different Sampling Schemes with 
the Population Parameter (Horizontal 
Line) . 
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estimates for all the sampling schemes, 
rather than most of the observations being 
sither above or below the line. However, 
as indicated in Table 2, the smaller 
sampling units tend to yield estimates 
which have less bias. The average rela- 
tive difference of pixel samples and 
2 x 2 nra samples from the population 
parameter was only 0.3%, while the 5x6 
nm segments gave estimates with an average 
relative difference of 2.4%. 

Two types of nonparametric tests 
were performed to assess the bias of the 
several sampling methods. The Kruskal- 
Wallis rank sum test for one-way classi- 
fications was used to determine the effect 
of sampling unit sire on the area esti- 
mates. * No significant difference in the 
means was found. The sign test was per- 
formed on the estimates to determine if 
the mean of any of the sampling schemes 
was significantly different from the true 
area of the data sampled. 3 Again, no 
statistically significant differences were 
found. 

Although none of the sampling schemes 
appeared to have a systematic bias, it 
is important to examine the maximum bias 
which was generated by each of the samp- 
ling schemes. The maximum bias was 
directly related to the sampling unit 
size. The maximum absolute bias for 
pixel samples was only about 39,000 hec- 
tares, while one 5 x 6 nm sample gave an 
overestimate of 498,000 hectares. 

In summary then, although no syste- 
matic bia3 is present, it is important to 
consider the maximum bias or range of es- 
timates which would be obtained using a 
given sampling scheme in an operational 
setting. In practice, sampling would be 
conducted only once? thus, a one in eight 
chance of obtaining a bias of 500,000 
hectares may be a significant considera- 
tion. 

V. SUMMARY AND CONCLUSIONS 

The results of this investigation 
are well illustrated in Figure 2. The 
area estimates found by the use of 
5 x 6 nra segments cover a much larger 
range of values and thus have a larger 
variability than any of the other segment 
sizes. The estimates become mere and more 
precise as the segment size decreases and 
more segments are taken. The estimates 
achieved using the 5x6 nm segments have 
the least precision of any sampling scheme 
tested. The precision of the 5 x 6 nm 
segments was significantly less than that 
of the pixel samples. 


None of the sampling schemes was sig- 
nificantly biased on the average, and none 
of the average estimates differed sig- 
nificantly from the population parameter. 
The maximum absolute' bias, however, was 
directly related to sampling unit size 
and should be considered in selection of a 
sampling unit. 

To assess the implications of the 
result of this study for operational use, 
other factors must be considered. In 
order to fully evaluate the scheme, the 
method of training and classification 
which would be used in conjunction with a 
sampling plan must also be considered. 

And, although the precision of estimates 
from choosing more but smaller segments 
may be higher, this. gain in precision must 
be weighed against the costs of sample 
selection and classification. 

A somewhat similar study was recently 
conducted by Perry.® The objective of 
that study was to ascertain the effect of 
a change in the sampling unit size on the 
total number of sampling units necessary 
to support a wheat production estimate 
with a specified coefficient of variation. 
The results obtained by Perry are suppor- 
tive of the conclusions of this investi- 
gation, but it was concluded that no 
recommendation for the optimal sampling 
unit size can be made until a model for 
the cost as a function of the sampling 
unit size is developed. 
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